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A fast and robust method for determining the parameters for 
a flat (mask-based) bulk-solvent model and overall scaling 
in macromolecular crystallographic structure refinement and 
other related calculations is described. This method uses 
analytical expressions for the determination of optimal values 
for various scale factors. The new approach was tested using 
nearly all entries in the PDB for which experimental structure 
factors are available. In general, the resulting R factors are 
improved compared with previously implemented approaches. 
In addition, the new procedure is two orders of magnitude 
faster, which has a significant impact on the overall runtime of 
refinement and other applications. An alternative function 
is also proposed for scaling the bulk-solvent model and it is 
shown that it outperforms the conventional exponential 
function. Similarly, alternative methods are presented for 
anisotropic scaling and their performance is analyzed. All 
methods are implemented in the Computational Crystallo- 
graphy Toolbox (cctbx) and are used in PHENIX programs. 

1. Introduction 

Macromolecular crystals typically contain a substantial 
amount of disordered solvent, ranging from approximately 
20 to 90% of the crystal volume, with a mean of 55%, in the 
Protein Data Bank (PDB; Bernstein et al, 1977; Berman et al, 
2000). Anisotropy in the diffracted intensities is another 
common feature of macromolecular crystals that arises from 
various sources including crystal lattice vibrations (Shakked, 
1983; Sheriff & Hendrickson, 1987). When modelling 
diffracted intensities, for example in structure refinement or 
automated model building, it is therefore critical to account 
for these two phenomena (see, for example, Jiang & Briinger, 
1994; Urzhumtsev & Podjarny, 1995; Kostrewa, 1997; Badger, 
1997; Urzhumtsev, 2000; Fokine & Urzhumtsev, 2002fl; Fenn et 
al., 2010). The flat bulk-solvent model (Phillips, 1980; Jiang & 
Briinger, 1994) combined with overall anisotropic scaling in 
either exponential (Sheriff & Hendrickson, 1987) or poly- 
nomial (Uson et al., 1999) forms is a weU estabhshed and 
computationally efficient approach. Alternatives have been 
proposed (Tronrud, 1997; Vassylyev et al, 2007), but are not 
currently in wide use. 

In the commonly used approach, the total structure factor is 
defined as 



Received 1 3 December 201 2 
Accepted 5 January 201 3 



^model — ^total(J^calc 



F ^ 

mask mask/' 



(1) 



where fctotai is the overall Miller-index-dependent scale factor, 
Fcaic and F^ask are the structure factors computed from the 
atomic model and the bulk-solvent mask, respectively, and 
^mask is a bulk-solvent scale factor. The mask can be computed 
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efficiently using exact asymmetric units as described in 
Grosse-Kunstleve et al. (201f ). 

The overall scale factor fc,otai can be thought of as the 
product 

(2) 



k —k k k 

total ^overall isotropic anisotropic ' 



where ^overall is the overall scale factor and ^isotropic and 
'i^anisotropic are the isotropic and anisotropic scale factors, 
respectively. 

^overau IS a scalar number that can be obtained by mini- 
mizing the least-squares residual 

LS = 2Z(^obs ~ ^overalllFmodell) ' (3) 

where Fobs are the observed structure factors and 

^model ^isotropic ^anisotropic (^calc ^mask^mask)* (^) 

The sum is over all reflections. Solving 9LS/9A:overaii - 0 leads to 

^overall — Z^^obsl^'modell/ l^odell • (5) 

In the exponential model the anisotropic scale factor is 
defined as 



= exp(-2;r's'U^3tS), 



(6) 



"'amsotropic ^^FV ^cryst' 

where Ucryst is the overall anisotropic scale matrix equivalent 
to U* defined in Grosse-Kunstleve & Adams (2002); s' = {h, k, I) 
is the transpose of the Miller-index column vector s. 

Uson et al. (1999) define a polynomial anisotropic scaling 
function that can be rewritten in matrix notation as follows: 

^anisotropic — * VgS + (s 'VjS)j' , (7) 

where Vq and Vi are symmetric 3x3 matrices, = s'G*s and 
G* is the reciprocal-space metric tensor. Expression (7) is 
equivalent to the first terms in the Taylor series expansion of 

the exponential function (6), 

exp(-27r^s'U,^,,s) ~ 1 - 2;r^s'U,^,,s + 27r\s'U,,y,tS)(s'U„y,,s), 

(8) 

with the constant term omitted. The omission of the constant 1 
means that ^anisotropic is equal to zero for the reflection Fqoo, as 
follows from (7). Therefore, in this work we modify (7) by 
adding the constant 

*anisotropic = l+s'VoS+(s'ViS)/. (9) 

The bulk-solvent scale factor is traditionally defined as 

^mask = ^soi exp(-5,<„sV4), (10) 

where k^ox and 5soi are the flat bulk-solvent model parameters 
(Philhps, 1980; Jiang & Briinger, 1994; Fokine & Urzhumtsev, 
20025). 

Depending on the calculation protocol, /fisotropic may be 
assumed to be a part of /Canisotropic or it can be assumed to be 
exponential: ^isotropic - exp(— Ss^/4), where 5 is a scalar 
parameter. Alternatively, it may be determined as described in 
§2.3 below. 

The determination of the anisotropic scaling parameters 

(Ucryst 

or Vo and Vi) and the bulk-solvent parameters k^ox and 
5soi requires the minimization of the target function (3) with 



respect to these parameters. Despite the apparent simplicity, 
this task is quite involved owing to a number of numerical 
issues (Fokine & Urzhumtsev, 20026; Afonine et al, 2005a). 
Previously, we have developed a robust and thorough proce- 
dure (Afonine et al, 2005fl) to address these issues. This 
procedure is used routinely in PHENIX (Adams et al, 2010). 
However, owing to its thoroughness the procedure is relatively 
slow and may account for a significant fraction of the execu- 
tion time of certain PHENIX applications (for example, 
phenix.refine). 

In this paper, we describe a new procedure which is 
approximately two orders of magnitude faster than the 
approach described in Afonine et al (2005a) and often leads 
to a better fit of the experimental data. The speed gain is the 
result of an analytical determination of the optimal bulk- 
solvent and scaling parameters. The better fit to the experi- 
mental data is partially the result of employing a more detailed 
model for A;„iask compared with the exponential model in 
equation (10) and is partially a consequence of the new 
analytical optimization method. Analytical optimization 
eliminates the possibility of becoming trapped in local minima, 
which exists in all iterative local optimization methods, 
including the procedure used previously. 

2. Methods 

2.1 . Anisotropic scaling: exponential model 

To obtain the elements of the anisotropic scaling matrix (6), 
the minimization of (3) is replaced by the minimization of 



LSL = i:[ln(F„J-ln(|F^„,,,|)f. 



(11) 



For this, we assume that .Fobs and |F,„odeil are positive. We also 
assume that the minima of (3) and (11) are at similar locations. 
This assumption is not obvious and, as discussed below, may 
not always hold (see §3.3 and Table 2). Expression (11) can be 
rewritten as 



LSL = (2;r^)'n2 + s'U,^stS)'- 

s 

Here, Z= [l/(2;r^)]ln[Foi,s(A:overall^isotropic|Fcalc + ^maskFmaskl 

Defining 



LSL = LSL/(2;r^)^ 



and using 



cryst 







f/l3 


Ux2 




U23 




U23 


U33 



(12) 
1)-^]. 

(13) 
(14) 



the target function determining the optimal Ucryst is 
LSL = ^(Z -I- Uuh^ + U22k^ + U^^l^ 

S 

+ lU^^hk + lU^^hl + lU^^klf . 



(15) 



The Ucryst values that minimize (15) are determined from the 
condition V^LSL = 0, which gives a system of six linear 
equations 
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MU„y,. = b. (16) 

where M = V (g) V, V = (h^, k^, 2hk, 2hl, lkl)\ (g) denotes 
the outer product and b = — ZV. 

The desired Ucryst matrix is determined by solving the 
system (16): 

= M-'b. (17) 

Crystal-system-specific symmetry constraints can be incor- 
porated via a constraint matrix (C), which we derive from first 
principles by solving the system of linear equations R'UR = U 
for all rotation matrices R of the crystal-system point group. 
Alternatively, symmetry constraints are often derived manu- 
ally and tabulated (Nye, 1957; Giacovazzo, 1992). For 
example, the constraint matrix for the tetragonal crystal 
system is 

/I 1 0 0 0 0\ 

^-(,0 0 1 0 0 oj- ^^^^ 

The number of rows in C determines the number of inde- 
pendent coefficients of Ucryst- Let Ujnd be the column vector 
of independent coefficients; the (redundant) set of six coeffi- 
cients Ucryst is then obtained via 

Ucryst = (t^ll U,, f/33 U,, t/i3 f/23)=C'Ui„,. (19) 

The constraint matrix C is introduced into equations (16) and 
(17) above as follows: 

McU„d = be (20) 
with Mc = Eh Vc <H) Vc, Vc = CV, be = - Eh Z\c and 

U,„, = Mc'bc- (21) 
The full Ucryst is then determined via equation (19). 

2.2. Anisotropic scaling: polynomial model 

The polynomial model (Uson et al, 1999) for anisotropic 
scahng allows the direct use of the residual (3) to find the 
optimal coefficients for Vq and Vi in equation (9). An 
advantage of this model is that no assumptions about the 
similarity of the location of the minima of targets (3) and (11) 
are required. Conceptually, a disadvantage of equation (9) is 
that it is only an approximation of equation (6), as was shown 
above. However, the number of parameters is doubled in 
equation (9) compared with equation (6), since Vo and Vi are 
treated independently. The increased number of degrees of 
freedom may therefore compensate for approximation in- 
accuracies. 

Similarly to §2.1, the optimal coefficients for Vo and Vi are 
determined by the condition VyLS = 0 and can be obtained by 
solving a system of 12 linear equations. We follow the argu- 
ments of Uson et al. (1999) for not using symmetry constraints 
in this case. 

2.3. Bulk-solvent parameters and overall isotropic scaling 

Defining K = ^total ~ (^overall ^isotropic ^anisotropic) > 

determination of the desired scaling parameters Anisotropic and 
A^mask is reduced to minimizing 



l^XK, k^s^) = EdFcalc + ^maskFmaskl' ' Klf (22) 

in resolution bins, where /Coverau and Aranisotropic are fixed. This 
minimization problem is generally highly overdetermined 
because the number of reflections per bin is usually much 
larger than two. 

Introducing w = |F,„3skP, v = (Fcaic, F„ask) and u = \F^^f and 
substitution into (22) leads to 

LS,(i^, k^J = E[(^LskW + 2k^^v + u)- Klf. (23) 

s 

Minimizing (23) with respect to K and k,^^ leads to a system 
of two equations: 

— LS,(A:, /C„ask) = - ElC^askWj + 2^maskV, + U,) - KQI, 
= 0 ' 

3 

— LS,(^:, fc„ask) = 2 E[(^Lsk>^. + 2fc™askV, + - KQ 

X (^maskW, + Vj = 0. 

(24) 

Developing these equations with respect to femask^ 

^mask E ^sl, + 2fcmask E + E ".-^^ " E = 0 

,V .V S S 

s s s 

+ 'EusVs-Kj2l,v,^0, 

s s 

(25) 

and introducing new notations for the coefficients, we obtain 

{^maskQ + ^mask-B2 + ^2 " = 0 

kLs^D, + ^LskQ + k^^^iB, - KC2) +A,- KY, = 0. 

(26) 

Multiplying the second equation by Y2 and substituting KY2 
from the first equation into the new second equation, we 
obtain a cubic equation 

kLs^iD,Y2 - Cl) + Csk(C3>'2 - C2B2 - C2Y,) (27) 
+ k^^^^{B,Y2 - C2A2 - Y.B^) + {A,Y2 - Y.A^) = 0. 

The senior coefficient in (27) satisfies the Cauchy-Schwarz 
inequality: 

£'3>'2-C^ = Eh'?E^-E'^AEV,>0. (28) 

s s s s 

Therefore, equation (27) can be rewritten as 

^mask + «^Lsk + ^^mask + C = 0 (29) 

and solved using a standard procedure. 

The corresponding values of K are obtained by substituting 
the roots of equation (29) into the first equation in (26): 

K = (le^^^C^ + k^^B2 + A^j/Y^. (30) 

If no positive root exists /Cmask is assigned a zero value, which 
imphes the absence of a bulk-solvent contribution. If several 
roots with /Cmask > 0 exist then the one that gives the smallest 
value of LSj(^C, /cmask) is selected. 
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If desired, one can fit the right-hand side of expression (10) 
to the array of /Cmask values by minimizing the residual 

LS = El'^mask - ^soi exp(-S,„i*V4)]^ (31) 

for all A;n,ask > 0. This can be achieved analytically as described 
in Appendix A. Similarly, one can fit A:overaiiexp(— Boveraii*^/4) 
to the array of K values. 

2.4. Presence of twinning 

In case of twinning with twin-related domains, the total 
model intensity is 



7=1 



(32) 



where oij is the twin fraction of the jth domain, Ty is the 
corresponding twin operator (a 3 x 3 rotation matrix) and 

I,(T,s) = /c,„,„(Tys)|F,,„(Tys) + /c„,,,(T^.s)F^,,k(Tys)|l (33) 

ktoiai includes all scale factors (overall, isotropic and aniso- 
tropic). We make the reasonable assumption that A;,otai and 
^mask are identical for all twin domains. 

Finding the twin fractions aj can be achieved by solving the 
minimization problem 

2 

(34) 



LS(ai, . . . ,ajv) = X! 

s 

with the constraint condition 



i:«;/;(Sy)-/(s) 

7=1 



C(ai, . . . , ttiv) = L «y - 1 = 0' 



(35) 



where /(s) = Fobs and Sy = T,s. This constrained minimization 
problem can be reformulated as an unconstrained minimiza- 
tion problem by the standard technique of introducing a 
Lagrange multiplier: 

LS(ai, . . . , a/^, k) = LS(ai, . . . , a^y) + kC{ai, . . . , aj^). (36) 

The values {ai, . . . , a^. A) that minimize (36) are the solution 
of the system of N + 1 linear equations with A' + 1 variables: 

9LS(ai, . . . , 0!jY, k)/da^ = 0 



9LS(ai, . . . , ttjY, k)/daf^ — 0 
dL^la^, . . . ,af^,k)/dk = Q 



(37) 



or 



E 

s 

E 



E«/;(Sy)-/(s) 

E«y//Sj)-/(S) 



/^(Si) -H A = 0 
/^(%) + ^ = 0 



(38) 



E«y-1 = 0. 



The solution of this system is 

{a^,...,aj^,k)' = M"^b 



(39) 



with the {N + 1) x {N + 1) matrix 

M=(?7^ o)- 

and 

V=[/i(si),...,/^(s^)]. 



(40) 



(41) 



Here, 1 is a row or column containing unit elements to 
complete the matrix M and 



E/(s)/i(^i),...,E^(s)M*iv).i 



(42) 



The values of e are expected to be between 0 and 1, and k is 
proportional to the sum of squared intensities. Therefore, it is 
numerically beneficial to multiply the kC{ai, . . . , a;v) term 
in (36) by a constant Es order to make the value for k 

numerically similar to the values for the twin fractions a. 

Once the twin fractions have been found, the procedure 
described in §2.3 can be used to obtain the overall and bulk- 
solvent scale factors. Similarly to (23), we can write 

-i2 



LS,(/:, /c_,) = E 



E «;|Fcalc(Sy) + ^^maskFrnaskCS;) P 



KI 



(43) 



where Uj are known twin fractions and K and fc^ask are the 
scale factors to be determined. Similarly to §2.3, we obtain 



Ea;|Fcalc(S;) + ^maskFmask(S;)P 
/=1 



: E{«;|Fcalc(S;)l' 
/=1 



+ 2Way[F,aic(S;)F„,3k(Sy)] + /4aska;|F„ask(S;)l'}. (44) 

Introducing new variables as before for equation (23) leads to 
LS,(ii:, k^^^) = ElCCskt^ + 2/:„askV + m) - Klf. (45) 

The determination of the twin fractions a and scales /ctotai and 
^mask are iterated several times until convergence. The deter- 
mination of a does not guarantee that the individual twin 
fractions ay are in the range 0-1. For any ay outside this range 
the corresponding twin operation is ignored for the current 
iteration and the new smaller set of twin fractions and scales 
are redetermined. However, in the next iteration the full set of 
a is tried again. 

3. Results 

3.1. Implementation of the new protocol 

The scale factors involved in the calculation of Fmodei 
according to equation (1) are highly correlated. Therefore, the 
order of their determination is important. Empirically, we 
found that the determination of A:isotropic and /Cmask followed by 
the determination of /fanisotropic works optimally in most cases. 
The determination of (fc^ask, 'Cisotropic) and /Canisotropic is re- 
peated several times until the R factor decreases by less than 
0.01% between cycles. The number of cycles required to reach 
convergence is typically between 1 and 5. 
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Table 1 

Comparison of binning schemes performed witli d^^ and In(rf) spacing for three selected PDB data sets: Ikwn, Shay and 3gk8. 



All three data sets have very low completeness in the lowest resolution bin, which d^^ binning obscures while ln(J) binning makes clear even when using 
approximately half the number of bins. Completeness in the high-resolution region is similar in the two binning schemes. For each binning method three columns of 
data are presented: resolution range (A), completeness and number of reflections. 





Ikwn 








3hay 






3gk8 














Bin No. 


d-^ 




ln(d) 




d-^ 




ln{d) 


d-' 






ln(d) 








1 


19.96-3.25 


0.967 4363 


19.96- 


-7.87 0.860 301 


44.86-13.44 


0.932 715 


44.86-17.61 0.852 300 


22.18- 


-5.00 0.906 


1938 


22.18- 


-8.16 


0.610 


300 


2 


3.25-2.58 


0.997 4280 


7.87- 


-6.33 0.971 300 


13.43-10.71 


1.000 716 


17.58-14.23 1.000 301 


5.00- 


-3.98 0.994 


2052 


8.15- 


-7.00 


0.993 


300 


3 


2.58-2.26 


0.999 4214 


6.33- 


-5.10 0.966 564 


10.71-9.37 


1.000 688 


14.22-11.51 1.000 556 


3.98- 


-3.48 0.997 


2060 


7.00- 


-6.01 


0.996 


452 


4 


2.26-2.05 


1.000 4218 


5.10- 


-4.10 0.961 1037 


9.37-8.52 


1.000 693 


11.51-9.31 1.000 1011 


3.48- 


-3.16 0.995 


2051 


6.01- 


-5.16 


0.994 


700 


5 


2.05-1.90 


0.990 4135 


4.10- 


-3.30 0.986 1987 


8.52-7.91 


1.000 679 


9.31-7.53 1.000 1853 


3.16- 


-2.93 0.976 


1988 


5.16- 


-4.43 


0.993 


1087 


6 


1.90-1.79 


0.993 4133 


3.30- 


-2.66 0.997 3772 


7.91-7.45 


1.000 673 


7.53-6.10 1.000 3448 


2.93- 


-2.76 0.968 


1973 


4.43- 


-3.81 


0.996 


1735 


7 


1.79-1.70 


0.992 4119 


2.66- 


-2.14 0.999 7177 


7.45-7.08 


1.000 675 


6.10-4.99 0.997 5905 


2.76- 


-2.62 0.958 


1902 


3.81- 


-3.27 


0.996 


2716 


8 


1.70-1.63 


0.989 4070 


2.14- 


-1.72 0.993 13453 


7.08-6.77 


1.000 657 




2.62- 


-2.51 0.952 


1961 


3.27- 


-2.81 


0.979 


4149 


9 


1.63-1.57 


0.988 4094 


1.72- 


-1.38 0.990 25516 


6.77-6.51 


1.000 672 




2.51- 


-2.41 0.954 


1941 


2.81- 


-2.41 


0.955 


6410 


10 


1.57-1.51 


0.990 4093 


1.38- 


-1.20 0.989 28106 


6.51-6.29 


1.000 671 




2.41- 


-2.33 0.941 


1876 


2.41- 


-2.07 


0.931 


9748 


11 


1.51-1.46 


0.987 4036 






6.28-6.09 


1.000 657 




2.33- 


-2.26 0.933 


1897 


2.07- 


-1.85 


0.827 


9681 


12 


1.46-1.42 


0.990 4073 






6.09-5.92 


1.000 655 




2.26- 


-2.19 0.940 


1881 










13 


1.42-1.39 


0.993 4088 






5.91-5.76 


1.000 666 




2.19- 


-2.13 0.931 


1876 










14 


1.39-1.35 


0.992 4057 






5.76-5.62 


1.000 656 




2.13- 


-2.08 0.914 


1838 










15 


1.35-1.32 


0.992 4077 






5.62-5.49 


1.000 667 




2.08- 


-2.03 0.897 


1834 










16 


1.32-1.29 


0.995 4052 






5.49-5.38 


1.000 653 




2.03- 


-1.99 0.891 


1766 










17 


1.29-1.27 


0.991 4047 






5.38-5.27 


1.000 635 




1.99- 


-1.95 0.865 


1765 










18 


1.27-1.24 


0.991 4045 






5.27-5.17 


1.000 663 




1.95- 


-1.92 0.825 


1645 










19 


1.24-1.22 


0.988 4026 






5.17-5.08 


1.000 660 




1.91- 


-1.88 0.767 


1537 










20 


1.22-1.20 


0.972 3993 






5.08-4.99 


0.973 623 




1.88- 


-1.85 0.732 


1497 











To determine k^^i^of^opi^, our protocol can make use of three 
available scaling methods: polynomial (poly; §2.2), exponen- 
tial with analytical calculation of the optimal parameters 
(expanai; §2.1) and exponential with the optimal parameters 
obtained via L-BFGS (Liu & Nocedal, 1989) minimization 
(expniin; Afonine et al., 2005fl). The three methods can be 
tested independently, in which case the result with the lowest 
R factor is accepted. However, because exp^jn is up to an 
order of magnitude slower than the other two methods it is not 
expected to be used routinely. 

The calculation of /cjsotropic ^nd Ar^asi^ requires dividing the 
data into resolution bins (§3.2). If oscillation of /cmask between 
bins occurs, smoothening (Savitzky & Golay, 1964) is applied 
to the bin-wise determined values of ^mask such that it reduces 
the oscillations without altering the monotonic behavior of 
'Cmask as a functiou of resolution (see Fig. 1). Finally, the 
smoothed values are assigned to individual reflections using 
hnear interpolation. The ^isotropic scales are updated using 
equation (5) in order to account for the changed k^^^]^. 




0 0.01 0.02 0.03 0 0.01 

Figure 1 

Examples of smoothening of fe,„asij. The original k^^^k (blue; obtained as 
three PDB entries with the PDB codes shown on the plots. 



As illustrated in §3.2, the minimum of the i?-factor function 

^=E|^obs-|F„>odell|/i:i^obsl (46) 

and the minimum of the least-squares function (22) can be at 
significantly different locations in the (/Cmask^ '^^isotropic) para- 
meter space. To assure that the final (^mask^ ^isotropic) values 
correspond to the lowest R factor, a fast grid search is 
performed around the optimal values of the least-squares 
function. 

3.2. Binning 

The goal of binning is to group data by common features to 
characterize each group by a set of common parameters. Here, 
the key parameter is the resolution d of reflections. Binning 
schemes with bins containing an approximately equal number 
of reflections {i.e. the resolution range is uniformly sampled in 
d^'^) or a predefined number of bins are typically used. Since 
the low-resolution region of the data is sparse, such binning 



Ikwn 




lota 



0.02 0.03 0 0.01 0.02 0.03 

solution of equation 29) and that after smoothening (red) are shown for 



the 
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Table 2 

Comparison of U^ryst corresponding to the minima of the functions LS 
(3), LSL (11) and R factor (46). 

To improve readability, the U„ysi are shown as B values with respect to a 
Cartesian basis (Grosse-Kunstleve & Adams, 2002). To reduce the runtimes 
for the systematic parameter searches (see text), we have selected examples 
with symmetry constraints leading to all-zero off-diagonal elements. 



PDB code 


Optimization 
target 


^22, ^33 


R factor 


2fih 


R factor 


-2.15, -1.85, -1.60 


0.1935 




LS 


-4.20, -3.90, -3.35 


0.2179 


2fih (data cut at 2.5 A) 


LSL 


-2.65, -1.95, -1.60 


0.1939 


R factor 


-9.30, -10.20, -10.35 


0.2417 




LS 


-18.35, -19.65, -20.75 


0.2599 


lous (data cut at 6.5 A) 


LSL 


-38.25, -42.15, -46.20 


0.3769 


R factor 


9.25, -2.20, 4.35 


0.2082 




LS 


2.90, -2.45, 8.60 


0.2086 




LSL 


19.55, 6.55, 12.85 


0.2088 



schemes tend to produce only one or very few low-resolution 
bins, which is insufficient to best model the bulk-solvent 
contribution. Unfortunately, decreasing the number of 
reflections per bin will disproportionally increase the number 
of bins (A'bins) at higher resolution and may still provide 
insufficient detail for the low-resolution data (Table 1). 

An alternative approach which divides the resolution range 
uniformly on a logarithmic scale ln(d) (Urzhumtsev et al, 
2009) efficiently solves this problem. The flowchart of the 
algorithm is shown in Fig. 2. This scheme allows the higher 
resolution bins to contain more reflections than the lower 
resolution bins and more detailed binning at low resolution 
without increasing the total number of bins. An additional 
reason for using logarithmic binning is that the dependence of 
the scales on resolution is approximately exponential (see 
previous sections), which makes the variation of scale factors 
more uniform between bins when a logarithmic binning 
algorithm is used. Table 1 compares binning performed 
uniformly in cT^ and in \n{d) spacing for three data sets (PDB 
entries Shay, Ikwn and 3gk8). Note the data completeness of 
the low-resolution bins. 



3.3. Systematic tests 

We evaluated the performance of the new scaling protocol 
by applying it to approximately 40 000 data sets selected from 
the PDB. The structures were selected by evaluating all PDB 
entries using phenix.model_vs_data (Afonine et al., 2010) and 
excluding all entries for which the recalculated i?work was 
greater than the published value by five percentage points. 

To score the test results three crystallographic R factors (46) 
were computed using all reflections, using only low-resolution 
reflections and using only high-resolution reflections. Low- 
resolution reflections were selected using the condition rf^in > 
8 A but selecting at least the 500 lowest resolution reflections. 
High-resolution reflections were taken from the highest 
resolution bin. Each of the three anisotropic scaling methods 
(poly, expanai and exp^i^) was tested independently within 
each run. Additionally, two other tests were performed: one 
combining poly and expanai as described in §3.1 (referred to as 



poly-i-expanai) and the other using the protocol of Afonine et al. 
(2005fl) (referred to as old). 

Fig. 3 shows a comparison of the alternative methods for 
determining ^anisotropic (see §3.1). Comparing the polynomial 
model (poly) versus the analytical exponential model (expa^ai), 
with a few minor exceptions poly results in slightly lower R 
factors overaU and for the low-resolution reflections, while 
expanai rcsults iu lower R factors for the high-resolution 
reflections. Comparing poly versus the original exponential 
model using minimization (exp^ui), the R factors are very 
similar overall and for the high-resolution reflections, while 
poly often results in lower R factors for the low-resolution 
reflections. Comparing the two different exponential models, 
expmin results in lower R factors overall and nearly identical 
results for low-resolution reflections, but expanai results in 
lower R factors for the high-resolution reflections. Fig. 4 
compares the new protocol combining poly and expanai with 
the old protocol. With very few exceptions, the new protocol 
performs better for aU three resolution groups. 

As described above, occasionally the minima of the /?-factor 
function (46) and the LS function (22) are at significantly 
different locations in the (fcmask^ ^isotropic) parameter space (see 
Fig. 5). For example, considering /cisourop;,. to be a single-value 



Sort reflectioiis by resolution, 
ftom low id^) to high {d^^ 



Identify first N^^„ reflections: 



Identify next A'„i„ reflections: 
d,>d>d2 



Calculate bin step: 
A^ = taM)-ln(4) 



Estimate number of bins: 



If necessary, impose = A4i, 
A^ = [ln(rf,)-ln(4i*)]/M4. 



Compute bins ((4n, <4): 
Hd^{) = ln(,d,)-kK,k=0, 1,2, ...,N^, 



Figure 2 

Flowchart of the logarithmic resolution-binning algorithm. 
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scalar the pair (^mask. ^isotropic) that minimizes the R factor 
in the low-resolution range of PDB data set Ikwn is (0.2913, 
0.0961), while the pair (0.3218, 0.0863) minimizes the LS 
function. The corresponding R factors are 0.3073 and 0.3372, 
respectively. The data for PDB entry Ihqw lead to an even 



more dramatic difference, in which the pairs (^mask. ^isotropic) 
that minimize the R factor and the LS function are (0.25, 
0.0131) and (0.6166, 0.0151), respectively, and the corre- 
sponding R factors are 0.2924 and 0.5046. We made a similar 
observation for the overall anisotropic scale fcanisotropio as 



All reflections Low-resolution reflections only High-resolution reflections only 




0.05 0.15 0.25 0.35 0.45 0.05 0.15 0.25 0.35 0.45 0.05 0.15 0.25 0.35 0.45 

■R(exp„,|) R{e^P^d ^(expj 

(^) 

Figure 3 

A comparison of the new scaling protocol using different models for the anisotropic scale factor. R versus R factor scatter plots for (a) poly versus expa„ai, 
(b) poly versus exp,„in and (c) expanai versus exp,„in R factors were computed using all reflections (left), low-resolution reflections only (middle) and high- 
resolution reflections only (right). See §3.3 for details. 
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All reflections 



Low-resolution reflections only 



High-resolution reflections only 




0.25 0.35 

«(poly+exp.„,|) 



0.25 0.35 

i?(poly+exp,„j,) 



0.05 



0.15 



0.25 0.35 

J?(poly+exp.„,) 



0.45 



Figure 4 

R versus R factor scatter plots comparing the new scaling protocol using poly+expanai for the anisotropic scale factor with the old protocol. For each 
structure the full set of structure factors available from the PDB was used to calculate scale factors and to calculate R factors (left). Using the same scale- 
factor values the R factors were calculated separately for the low-resolution reflections (middle) and high-resolution reflections (right). A large spread of 
points in the vertical direction above the diagonal (red line) in these latter plots indicates that in many cases the scale factors produced by the old 
protocol resulted in a poorer fit to the data at low and high resolutions, while the new protocol generates scale factors with a good fit across all resolution 
ranges. See §3.3 for details. 



0.7 



0.5 



0.3 



illustrated in Table 2. For this, the best 
values for Ucryst were determined via a 
systematic search for the minima of the 
functions (3), (11) and (46) for three 
combinations of structures and high- 
resolution cutoffs. Note the difference 
in the optimal Ucryst values and the 
corresponding R factors. 

The parameterization of the total 
model structure factor (1) does not 
make any assumption about the shape 
of ^mask; ioT cxamplc, it does not assume 
it to be exponential (10). This provides 
an opportunity to explore the behavior 
of ^mask as a functiou of resolution and 
compare it with k^^ak obtained via (10). 
Fig. 6 illustrates the differences between 
the two methods of determining k^^^k 
for six representative PDB entries 
selected from approximately 40 000 
entries after inspection of the A:^^^!^ 
values. We observe that the plots of the 
values obtained using our new approach are in general 
significantly different from the exponential function. This 
observation is in line with Fig. 1 of Urzhumtsev & Podjarny 
(1995). 

At very low resolution the structure factors computed from 
the atomic model are approximately anticorrelated to the 
structure factors computed from the bulk-solvent mask: 

F„,ask - -pFcalc- (47) 

Here, p is a scale factor (Urzhumtsev & Podjarny, 1995). 
Relation (47) is the basis for alternative bulk-solvent 
scaling methods that employ the Babinet principle (Moews & 




0.15 



Figure 5 

Plots of R factors (with A;is(„„pi(. = 0.0961) and the LS function (with Arisotropk = 0.0863) for PDB entry 
Ikwn (left) and R factors (with /cisotropic = 0.0131) and the LS function (with /cisotropk = 0.0151) for 
PDB entry Ihqw (right), illustrating that the minima of the i?-factor function (46) and the LS 
function (22) can be at significantly different locations in parameter space. In such cases, a line 
search around the value of /c^ask obtained by minimization of the LS function is necessary in order 
to obtain a value that minimizes the R factor. For plotting purposes, the values of the LS function 
were scaled to be similar to the R factors. 



Kretsinger, 1975; Tronrud, 1997). Substitution of relation (47) 
into equation (1) yields 



P ^mask)Fci 



(48) 



Obviously, F^ojei is invariant for any combination of scale 
factors /ctotai and k^^^^^k satisfying the condition 



'^totalCl P 



const. 



(49) 



Since our new scaling procedure determines /Cmask and A:iso,jopic 
(which are part of kfotai) simultaneously, without imposing 
constraints on their values, these scale factors may assume 
unusual values in the low-resolution range. However, we 



632 Afonine et a/. • Bulk-solvent and overall scaling 



Acta Cryst. (2013). D69, 625-634 



research papers 




0.03 



0.01 



0.02 



0.03 




0.04 



Figure 6 

Plots of fc,„asij as a function of resolution (s^) for six selected PDB entries. The blue lines show /Cn,ask as determined using the new method. The red lines 
show fe^ask based on the exponential function (10) using optimized k^^i and B^oi parameters. 



Table 3 

Runtime comparison for selected PDB entries. 

Absolute runtimes for the new protocol range from a few hundredths of a 
second to a second. 



PDB 


Resolution 


No. of 


No. of 


Speed 


code 


(A) 


atoms 


reflections 


gain 


lusO 


0.66 


3679 


511265 


105 


lakg 


1.10 


136 


4471 


132 


lous 


1.20 


3784 


104889 


86 


lyjp 


1.80 


66 


495 


64 


IfSt 


1.95 


3593 


28288 


104 


lavl 


4.00 


6588 


16201 


110 


IjM 


3.99 


4474 


7428 


78 


2i07 


4.0 


12157 


20412 


126 


2gsz 


4.2 


16344 


17131 


166 



observe that in practice tliis only liappens for a very small 
number of tlie test cases. 

4. Discussion 

A new method for overall anisotropic and bulk-solvent scahng 
of macromolecular crystallographic diffraction data has been 
developed which is an improvement over the existing algo- 
rithm of flat (mask-based) bulk-solvent modeling and overall 
anisotropic scaling, versions of which are routinely used in 
various refinement packages such as CNS (Brunger, 2007), 
REFMAC (Murshudov et al., 2011) and phenix.refine (Afonine 
et al., 2012). In the process of developing this method, we 
concluded that the bulk-solvent scale factor k^^sk deviates 
quite significantly from the exponential model that has tradi- 
tionally been used. This new method is approximately two 
orders of magnitude faster than the previous implementation 
and yields similar or often better R factors. Table 3 compares 
runtimes for a number of selected cases covering a broad 
range of resolutions and atomic model sizes. Therefore, the 
computational speed of the new method makes it possible to 



robustly compute bulk-solvent and anisotropic scaling para- 
meters even as part of semi-interactive procedures. 

An inherent feature of the mask-based bulk-solvent model 
is that it relies on the existing atomic model to compute the 
mask. This in turn implies that any unmodeled (as atoms) 
parts of the unit cell are considered to belong to the bulk- 
solvent region. This may obscure weakly pronounced features 
in residual maps such as partially occupied solvent or ligands. 
This is common to all mask-based bulk-solvent modeling 
methods, leading to the development of algorithms to account 
for missing atoms (Roversi et al., 2000). In the future, 
improved maps may be obtained by combining this latter 
approach with the new fast overall anisotropic and bulk- 
solvent scaling method that we have presented. 

The new method is implemented in the cctbx project 
(Grosse-Kunstleve et al, 2002) and is used in a number of 
PHENIX applications since v.1.8 of the software, most notably 
phenix.refine (Afonine et al, 2005£>, 2012), phenix.maps and 
phenix.model_vs_data (Afonine et al., 2010). The cctbx project 
is available at http://cctbx.sourceforge.net under an open- 
source license. The PHENIX software is available at http:// 
www.phenix-online.org. 

All results presented are based on PHENIX v.1.8.1. 

APPENDIX A 

Analytical derivation of a one-Gaussian approximation 
of a one-dimensional discrete data set 

Our goal is to approximate a set of data points jy(j«:)j^ = 1 with 
a Gaussian function. 



a exp(— fex^). 



(50) 



For this, we use the standard approach of minimizing a least- 
squares (LS) function, 
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LS = E[n^;)-«exp(-fc4)f. 



(51) 



If Y{xj) > 0 V xy, / = 1, A^, the minimization of LS can be 
replaced by the minimization of 



LSL = J2MYiXj)] - In[aexp(-to/)]}^ 



(52) 



The minimum of this LSL function can be determined 
analytically, 

LSL = ZlMYixj)] - ln[acxpi-bxj)]f 

;=i 

= j:Ma) - bxj - HYixj)]f. (53) 
Defining u = ln(a), Vj = xf, dj = ln[y(xy)], we obtain 

N 

LSL = ^(m - bvj - djf. (54) 

The variables {a, b] minimizing the LSL function are deter- 
mined by the condition 

9LSL 



du 
aLSL 

~lb 



= 0 



(55) 



= 0. 



This leads to 



and 



-2j:(u-bvj-dj) = 0 

7=1 

-2j:iu-bvj-d,)vj^0 



N N 

uN-bj:v^-j:d^=o 

N N N 

7=1 7=1 7=1 

-'V ,,2 , 



(56) 



(57) 



Defining p = J2j=i dj, q = Ey=i ^ = E;=i and 5 = ^jdj, 
we obtain 



and 



mA^ — bq — p — 0 
uq — br — s — Q 



u = ^ibq+p) 

b = -(uq — s). 
r 



(58) 



(59) 



From this, we obtain 



sq 



b = -(uq-s) 



N- 



(60) 



a = exp(M), i> = - (uq — s). 

r 



(61) 



and finally 
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